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Abstract 

/f -means  clustering  is  a  very  popular  clustering  technique  which  is  used  in  numerous  apphca- 
tions.  Given  a  set  of  n  data  points  in  and  an  integer  k,  the  problem  is  to  determine  a  set 
of  k  points  called  centers^  so  as  to  minimize  the  mean  squared  distance  from  each  data 
point  to  its  nearest  center.  A  popular  heuristic  for  A;-means  clustering  is  Lloyd’s  algorithm. 
In  this  paper  we  present  a  simple  and  efficient  implementation  of  Lloyd’s  A;-means  clustering 
algorithm,  which  we  call  the  hltering  algorithm.  This  algorithm  is  very  easy  to  implement, 
ft  differs  from  most  other  approaches  in  that  it  precomputes  a  kd-tree  data  structure  for  the 
data  points  rather  than  the  center  points.  We  estabhsh  the  practical  efficiency  of  the  hltering 
algorithm  in  two  ways.  First,  we  present  a  data-sensitive  analysis  of  the  algorithm’s  running 
time.  Second,  we  have  implemented  the  algorithm  and  performed  a  number  of  empirical  studies, 
both  on  synthetically  generated  data  and  on  real  data  from  apphcations  in  color  quantization, 
compression,  and  segmentation. 
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Abstract 

A"-means  clustering  is  a  very  popular  clustering  technique  which  is  used  in  numerous 
applications.  Given  a  set  of  n  data  points  in  and  an  integer  fc,  the  problem  is  to  de¬ 
termine  a  set  of  k  points  R*^,  called  centers,  so  as  to  minimize  the  mean  squared  distance 
from  each  data  point  to  its  nearest  center.  A  popular  heuristic  for  fc-means  clustering 
is  Lloyd’s  algorithm.  In  this  paper  we  present  a  simple  and  efficient  implementation 
of  Lloyd’s  fc-means  clustering  algorithm,  which  we  call  the  hltering  algorithm.  This 
algorithm  is  very  easy  to  implement,  ft  differs  from  most  other  approaches  in  that  it  pre¬ 
computes  a  kd-tree  data  structure  for  the  data  points  rather  than  the  center  points.  We 
establish  the  practical  efficiency  of  the  hltering  algorithm  in  two  ways.  First,  we  present 
a  data-sensitive  analysis  of  the  algorithm’s  running  time.  Second,  we  have  implemented 
the  algorithm  and  performed  a  number  of  empirical  studies,  both  on  synthetically  gen¬ 
erated  data  and  on  real  data  from  applications  in  color  quantization,  compression,  and 
segmentation. 
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1  Introduction 


Clustering  problems  arise  in  many  different  applications,  such  as  data  mining  and  knowl¬ 
edge  discovery  [16],  data  compression  and  vector  quantization  [21],  and  pattern  recogni¬ 
tion  and  pattern  classification  [13].  One  important  class  of  clustering  problems,  some¬ 
times  called  k-means  clustering^  is  the  following:  Given  a  set  of  n  data  points  in  real 
d-dimensional  space,  R*^,  and  an  integer  fc,  determine  a  set  of  k  points  in  R*^,  called 
centers^  so  as  to  minimize  the  mean  squared  distance  from  each  data  point  to  its  nearest 
center.  This  measure  is  often  called  the  squared-error  distortion  [21,  24],  and  this  type 
of  clustering  falls  into  the  general  category  of  variance-based  clustering  [23,  22]. 

The  fc-means  problem  is  closely  related  to  a  number  of  other  clustering  problems, 
such  as  the  Euclidean  k-medians  problem  [2,  26],  in  which  the  objective  is  to  minimize 
the  sum  of  distances,  and  the  geometric  k-center  problem  [1],  in  which  the  objective  is 
to  minimize  the  maximum  distance.  Also  see  Capolyeas  et  al.  [9]  and  Jain  and  Dubes 
[24]  for  other  generalizations  and  issues  in  geometric  clustering  problems. 

An  asymptotically  efficient  approximation  for  the  fc-means  clustering  problem  has 
been  presented  by  Matousek  [31],  but  the  very  large  constant  factors  involved  in  the 
construction  suggest  that  it  is  not  a  good  candidate  for  implementation.  One  of  the 
most  popular  heuristics  for  solving  the  fc-means  problem  is  based  on  a  simple  iterative 
scheme  for  finding  a  locally  minimal  solution.  This  algorithm  is  often  called  the  k-means 
algorithm  [28,  18].  There  are  a  number  of  variants  of  this  algorithm,  so  to  clarify  which 
version  we  are  using,  we  will  refer  to  it  as  Lloyd’s  algorithm.  (To  be  even  more  accurate, 
it  would  be  better  to  call  it  the  generalized  Lloyd’s  algorithm.,  since  Lloyd’s  original  result 
was  for  scalar  data  [27].) 

Here  is  how  Lloyd’s  algorithm  works.  Given  any  set  of  k  centers,  for  each  center  c^, 
let  Vi  denote  the  set  of  data  points  for  which  ci  is  the  nearest  neighbor.  (Equivalently, 
Vi  is  the  set  of  data  points  lying  in  the  Voronoi  cell  of  Cj  relative  to  the  set  of  centers.) 
Eor  the  next  iteration  of  the  algorithm,  replace  Cj  with  the  centroid  of  Vi.  These  two 
steps  are  repeated  until  some  convergence  conditions  have  been  met.  See  Eaber  [15]  for 
descriptions  of  other  variants  of  this  algorithm. 

Eor  points  in  general  position,  the  algorithm  will  eventually  converge  to  a  point  that  is 
a  local  minimum.  This  is  true  because  any  local  minimum  for  this  problem  corresponds  to 
a  centroidal  Voronoi  configuration  (see  [15,  12]).  However,  the  result  is  not  necessarily  a 
global  minimum.  See  [7,  30,  33,  35]  for  further  discussion  of  its  statistical  and  convergence 
properties.  Because  of  its  simplicity  and  flexibility  it  is  a  very  popular  algorithm  and  is 
widely  used  in  statistical  analysis,  in  spite  of  its  apparent  limitations.  Eor  example,  it 
can  be  used  in  conjunction  with  another  more  global  algorithm  for  generating  clusters  as 
a  post-processing  stage  to  improve  the  quality  of  the  final  results. 

One  problem  shared  by  all  the  various  fc-means  algorithms  is  that  they  take  a  long 
time  to  run.  There  are  two  reasons  for  this.  Eirst,  they  are  often  applied  in  moderate- 
to  high-dimensional  spaces,  and  computing  nearest  neighbors  efficiently  in  such  spaces 
(without  resorting  to  brute-force  search)  is  not  a  trivial  problem.  The  second  reason 
is  that  often  many  iterations  are  needed  until  the  algorithm’s  termination  conditions 
are  satisfied.  (In  fact  it  can  be  shown  that  when  the  algorithm  is  sufficiently  close  to 
the  local  minimum,  the  algorithm  converges  at  a  linear  rate  [12].)  The  most  common 
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way  to  improve  the  efficiency  of  Lloyd’s  algorithm  is  to  compute  nearest  neighbors  more 
efficiently,  say  by  preprocessing  the  center  points  at  the  start  of  each  stage  into  a  data 
structure  for  answering  nearest  neighbor  queries. 

In  this  paper  we  present  a  simple  and  efficient  implementation  of  Lloyd’s  algorithm, 
which  we  call  the  filtering  algorithm.  This  algorithm  was  described  briefly  as  part  of  a 
more  complex  kinetic-based  fc-means  algorithm  [25].  Unlike  the  kinetic  algorithm,  this 
algorithm  is  very  easy  to  implement,  and  only  requires  that  a  kd-tree  be  built  for  the 
data  points.  We  present  the  algorithm  and  the  underlying  data  structures  in  Section  2. 
The  hltering  algorithm  differs  from  most  other  approaches  in  that  it  precomputes  a  data 
structure  for  the  data  points,  rather  than  the  center  points.  Intuitively,  this  is  a  smart 
thing  to  do  because  (1)  the  data  points  do  not  vary  throughout  the  computation,  and 
hence  this  data  structure  does  not  need  to  be  recomputed  at  each  stage,  and  (2)  there 
are  typically  many  more  data  points  than  query  points,  and  hence  the  relative  advantage 
provided  by  preprocessing  is  greater.  We  should  emphasize  that  this  is  not  a  different 
clustering  method  from  Lloyd’s  algorithm,  but  rather  a  more  efficient  implementation  of 
the  algorithm. 

We  establish  the  practical  efficiency  of  the  hltering  algorithm  in  two  ways.  First, 
we  present  a  data-sensitive  analysis  of  the  algorithm’s  running  time.  The  algorithm 
may  perform  as  poorly  as  the  simple  brute-force  algorithm  in  the  worst  case,  but  we 
claim  that  in  the  typical  situations  where  the  algorithm  will  be  used  (in  particular  where 
clustering  is  present  in  the  data)  the  algorithm  runs  quite  efficiently.  Furthermore,  the 
efficiency  improves  as  the  clusters  become  more  isolated  from  each  other.  These  results 
are  presented  in  Section  3.  Second,  we  have  implemented  the  algorithm  and  performed 
a  number  of  empirical  studies,  both  on  synthetically  generated  data  and  on  real  data 
from  applications  in  color  quantization,  compression,  and  segmentation,  as  described  in 
Section  4. 

2  The  Filtering  Algorithm 

In  this  section  we  describe  the  hltering  algorithm.  We  begin  with  a  review  of  the  data 
structure  that  the  algorithm  uses.  The  structure  is  a  balanced  box-decomposition  tree 
(or  BBD-tree)  for  the  point  set.  The  BBD-tree  [4]  is  a  balanced  variant  of  a  number 
of  well-known  data  structures  based  on  hierarchical  subdivision  of  space  into  rectilinear 
regions.  Fxamples  of  this  class  of  structures  include  point  quadtrees  [34]  and  variants  [6], 
k-d  trees  [5],  and  the  (unbalanced)  box-decomposition  tree  (also  called  a  fair-split  tree) 
[8,  10,  36].  A  related  structure  is  the  BAR  tree  [14],  which  partitions  space  into  regions 
of  bounded  aspect  ratio  but  the  regions  need  not  be  rectilinear.  For  completeness  we 
review  the  BBD-tree,  emphasizing  the  elements  that  are  important  for  this  application. 
The  special  properties  guaranteed  by  the  BBD-tree  are  important  for  our  theoretical 
analysis.  In  practice,  the  simpler  kd-tree  is  usually  good  enough,  and  was  used  for  our 
experiments. 

We  begin  with  a  few  dehnitions.  By  a  rectangle  in  R'^  we  mean  a  d-fold  product  of 
closed  intervals  on  the  coordinate  axes.  The  size  of  a  rectangle  is  the  length  of  its  longest 
side.  We  dehne  a  box  to  be  a  rectangle  such  that  the  ratio  of  its  longest  to  shortest  side, 
called  its  aspect  ratio,  is  bounded  by  some  constant,  which  for  concreteness  we  will  take 
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to  be  2. 

Each  node  of  the  BBD-tree  is  associated  with  a  region  of  space  called  a  cell.  In 
particnlar,  dehne  a  cell  to  be  either  a  box  or  the  set-theoretic  difference  of  two  boxes, 
one  enclosed  within  the  other.  Thns  each  cell  is  dehned  by  an  outer  box  and  an  optional 
inner  box.  Each  cell  is  associated  with  the  set  of  data  points  lying  within  the  cell.  Cells 
are  considered  to  be  closed,  and  hence  points  which  he  on  the  bonndary  between  cells 
may  be  assigned  to  either  cell.  The  size  of  a  cell  is  the  size  of  its  onter  box. 

The  following  two  lemmas  summarize  what  we  need  to  know  abont  the  BBD-tree  and 
its  properties.  See  [4]  for  proofs. 

Lemma  1  Given  a  set  of  n  data  points  P  in  and  a  bounding  hypercube  C  for  the 
points,  in  0{dn\ogn)  time  it  is  possible  to  construct  a  binary  tree  representing  a  hierar¬ 
chical  decomposition  of  C  into  cells  of  complexity  0[d)  such  that 

(i)  The  tree  has  0(n)  nodes  and  depth  O(logn). 

(ii)  The  cells  have  bounded  aspect  ratio,  and  with  every  2d  levels  of  descent  in  the  tree, 
the  sizes  of  the  associated  cells  decrease  by  at  least  a  factor  o/f/2. 

Lemma  2  (Packing  Constraint)  Consider  any  set  C  of  cells  of  the  BBD-tree  with  pair¬ 
wise  disjoint  interiors,  each  of  size  at  least  s,  that  intersect  a  ball  of  radius  r.  The  size 

of  such  a  set  is  at  most  ^f  +  • 

Now  we  describe  how  the  hltering  algorithm  implements  Lloyd’s  algorithm.  Recall 
onr  objective.  Eor  each  of  the  k  centers,  we  need  to  compnte  the  centroid  of  the  set 
of  data  points  for  which  this  center  is  closest.  After  this,  we  move  this  center  to  this 
centroid,  and  proceed  to  the  next  stage.  The  algorithm  begins  by  bnilding  a  BBD-tree  for 
the  data  points.  This  is  done  only  once  (not  repeated  at  each  stage)  since  the  data  points 
do  not  change  thronghont  the  compntation.  Eor  each  node  u  in  the  tree,  we  compnte  the 
centroid  of  its  associated  data  points,  which  is  then  weighted  (mnltiplied)  by  the  nnmber 
of  snch  points,  ft  is  easy  to  modify  the  BBD-tree  constrnction  to  compnte  this  additional 
information  in  the  same  time  bonnds. 

Next  we  consider  the  processing  for  each  individnal  stage.  Eor  each  node  of  the  BBD- 
tree  we  will  maintain  a  set  of  candidate  centers.  These  are  dehned  to  be  the  center  points 
that  might  serve  as  the  nearest  neighbor  for  some  point  lying  within  the  associated  cell. 
These  candidates  are  compnted  as  follows.  The  candidate  centers  for  the  root  consist  of 
all  k  centers.  We  then  hlter  the  candidates  down  nsing  the  following  recnrsive  prnning 
algorithm.  Eor  each  node  u\ei  C  =  cell(n)  denote  its  cell  and  let  Z  =  cand(n)  denote  its 
set  of  candidates.  We  compnte  the  candidate  z*  (2  Z  that  is  closest  to  the  center  of  C . 
Then  for  each  candidate  z  ^  Z  —  {z*},  ii  no  point  of  C  is  closer  to  2:  than  it  is  to  z*,  we 
can  infer  that  2:  is  not  the  nearest  center  to  any  data  point  associated  with  u,  and  hence 
we  can  eliminate  2:  from  the  list  of  candidates.  If  u  is  associated  with  a  single  candidate 
(which  mnst  be  z*)  then  2*  is  the  nearest  neighbor  of  all  its  data  points.  We  can  assign 
them  to  2*  by  adding  the  associated  weighted  centroid  to  an  accnmnlator  associated  with 
2*.  Otherwise,  if  u  is  an  internal  node,  we  recnrse  on  its  children.  If  n  is  a  leaf  node, 
we  compnte  the  distances  from  its  associated  data  point  to  all  the  candidates  in  Z,  and 
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assign  the  data  point  to  its  nearest  center.  To  aid  in  the  process,  we  make  nse  of  the 
following  ntility  procednre,  which  can  be  compnted  easily  in  0[d)  time.  Given  points  z*, 
z  and  cell  G,  closer(z*,  z,  C)  retnrns  false  if  if  any  part  of  C  is  closer  to  z  than  to  z*. 


Filter(n,Z)  { 

C  =  cell(n); 

z*  =  the  closest  point  in  Z  to  Cs  center; 
for  each  z  ^  Z  —  {z*}  { 

if  closer(z*,  z^C)  {  Z  —  =  {z};  } 

if((|Z|  ==  1)  V  (n  is  a  leaf))  z*.accum  +=  u. centroid] 

else  { 

Filter(n./e_/it,  Z)] 

Fi\teT[u.right,  Z)] 

} 

} 

} 


Fignre  1:  Filtering  Algorithm. 

The  hgnre  presents  a  more  detailed  description  of  the  recnrsive  procednre,  called 
Filter(n,Z).  Here  u  is  the  cnrrent  node  and  Z  are  the  candidates  passed  in  from  n’s 
parent.  The  initial  call  is  to  Filter(r,  Zq),  where  r  is  the  root  of  the  tree  and  Zq  is  the 
set  of  all  centers. 

3  Data-Sensitive  Analysis 

In  this  section  we  present  an  analysis  of  the  time  spent  in  each  stage  of  the  hltering 
algorithm.  Traditional  worst-case  analysis  is  not  appropriate  here,  since  it  is  not  difhcnlt 
to  concoct  scenarios  in  which  virtnally  all  of  the  nodes  in  the  BBD-tree  are  visited,  and  in 
which  every  node  is  associated  with  all  k  center  points  as  candidates.  Thns  the  rnnning 
time  is  0[kn)  in  the  worst  case,  which  is  no  better  than  the  simple  brnte-force  algorithm. 

The  intnition  behind  onr  analysis  is  based  on  the  observation  that  a  great  deal  of  the 
rnnning  time  of  Lloyd’s  algorithm  is  spent  in  the  later  stages  of  the  algorithm,  when  the 
center  points  are  close  to  their  hnal  locations,  bnt  the  algorithm  has  not  yet  converged. 
Becanse  of  the  method’s  relatively  slow  convergence  rate,  a  signihcant  fraction  of  the 
method’s  overall  rnnning  time  is  spent  in  this  mode.  Onr  analysis  will  be  based  on  the 
assnmption  that  the  data  set  can  indeed  be  clnstered  into  k  natnral  clnsters,  and  that 
the  cnrrent  centers  are  located  close  to  the  trne  clnster  centers.  These  are  admitedly 
strong  assnmptions,  bnt  onr  experimental  resnlts  will  bear  ont  that  the  algorithm  is 
qnite  efficient  even  when  these  assnmptions  are  not  met. 

We  say  that  a  node  is  visited  if  the  hlter  procednre  is  invoked  on  it.  An  internal  node 
is  expanded  if  its  children  are  visited.  A  nonexpanded  internal  node  or  a  leaf  node  is  a 
terminal  node.  Under  what  circumstances  is  a  node  a  terminal  node?  Consider  a  cell  of 
the  BBD-tree  and  the  set  of  candidate  centers  that  are  associated  with  this  cell.  Let  z* 
denote  the  closest  candidate  to  the  center  of  the  cell.  If  no  point  of  the  cell  is  closer  to 
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another  candidate  center  than  it  is  to  z*,  then  the  algorithm  can  infer  that  all  the  data 
points  contained  within  the  cell  can  be  assigned  to  z*,  and  hence  it  will  be  a  terminal 
node.  If  this  condition  is  never  satished,  then  the  decomposition  continnes  nntil  the 
leaves  are  reached,  at  which  point  the  process  will  certainly  terminate.  Thns,  a  nonleaf 
node  whose  cell  intersects  the  Voronoi  diagram  cannot  be  a  terminal  node. 

For  example,  in  Fig.  2  the  node  with  cell  a  is  terminal  becanse  it  lies  entirely  within 
the  Voronoi  cell  of  its  nearest  center.  Cell  b  is  terminal  becanse  it  is  a  leaf.  However,  cell 
c  is  not  terminal  becanse  it  is  not  a  leaf,  and  it  intersects  the  Voronoi  diagram.  Observe 
that  in  this  rather  typical  example  if  the  clnsters  are  well-separated,  or  isolated,  and 
the  center  points  are  close  to  the  trne  clnster  centers,  then  a  relatively  small  fraction 
of  the  data  set  lies  near  the  Voronoi  diagram  of  the  centers.  The  more  isolated  the 
clnsters  are,  the  smaller  this  fraction  will  be,  and  hence  the  fewer  of  the  cells  of  the 
search  strnctnre  will  need  to  be  visited  by  the  algorithm.  Onr  analysis  will  show  that  in 
snch  circnmstances,  the  hltering  search  algorithm  performs  very  well. 


Fignre  2:  Filtering  algorithm  analysis.  Note  that  the  density  of  points  near  the  edges  of 
the  Voronoi  diagram  is  relatively  low. 

In  onr  analysis  we  model  this  sort  of  sitnation  as  follows.  We  assnme  that  the  data 
points  are  generated  independently  from  k  different  mnltivariate  distribntions,  i.e,  prob¬ 
ability  distribntions  in  R*^.  Consider  any  one  snch  distribntion.  Let  X  =  (xi,  X2, .  .  .  ,  x^) 
denote  a  random  vector  from  this  distribntion. 

Let  jj,  G  denote  the  mean  point  of  this  distribntion  and  let  S  denote  the  d  X  d 
covariance  matrix  ioi  the  distribntion  [20],  that  is 

S  =  £((X-,i)(X-,if). 

Observe  that  the  diagonal  elements  of  S  are  the  variances  of  the  random  variables  that 
are  associated,  respectively,  with  the  individnal  coordinates.  Let  tr(S)  denote  the  trace 
of  S,  that  is,  the  snm  of  its  diagonal  elements.  We  will  measnre  the  dispersion  of  the 
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distribution  by  defining  a  =  Ytr(I]).  This  provides  a  generalization  of  the  notion  of 
standard  deviation  for  univariate  distributions. 

We  will  characterize  the  degree  of  isolation  of  the  clusters  by  two  parameters.  Let 
and  denote  the  mean  and  dispersion  of  the  fth  cluster  distribution.  Let 

Tmin  =  i  min  I  and  Umax  =  rnaxcr^®), 

Z  « 

where  1^  — p|  denotes  the  Euclidean  distance  between  points  q  and  p.  The  former  quantity 
is  half  the  minimum  distance  between  any  two  cluster  centers,  and  the  latter  is  the 
maximum  dispersion.  Intuitively,  isolated  clusters  of  points  should  have  a  large  value  of 
Tmin  ^nd  a  small  value  of  We  dehne  the  cluster  isolation  of  the  point  distribution 

to  be 

f  min 

P  =  - • 

^max 

This  measure  is  similar  to  other  known  measures  of  separatedness  or  isolation.  For 
example,  Coggins  and  .Jain  [11]  dehned  a  measure  of  isolation  for  a  single  cluster  i  as  the 
ratio  minj^j 

We  will  show  that,  assuming  that  the  candidate  centers  are  relatively  close  to  the 
cluster  means  then  as  p  increases  (as  clusters  are  more  isolated)  the  algorithm’s  run¬ 
ning  time  improves.  Our  proof  makes  use  of  the  following  straightforward  generalization 
of  Chebyshev’s  inequality  (see  [17])  to  multivariate  distributions. 

Lemma  3  Let  Ji.  be  a  random  vector  in  R'^  drawn  from  a  distribution  with  mean  p  and 
dispersion  a  (the  square  root  of  the  trace  of  the  covariance  matrix).  Then  for  all  positive 

t, 

Pr(|X  -  p\>  ta)  < 

Proof:  Let  (xi,  X2, .  .  . ,  and  (//i,  p2,  ■  ■  ■  ■,  Pd)  denote  the  coordinates  of  X  and  //,  re¬ 
spectively.  Also,  for  1  <  1  <  d,  let  cr|  =  E[[xj  —  PjY).  (Note  that  in  this  proof,  pj 
refers  to  a  single  coordinate  of  the  vector  //,  as  opposed  to  p(^\  which  stands  for  the  jth 
distribution.)  Thus  =  J2j=i  From  Chebyshev’s  inequality  [17],  for  1  <  j  <  d 

Pr(|xj  -pjj  >  taj)  < 

Summing  over  all  j  we  have 

^Ti{\xj  -  pY  >  taY  =  Y.Ti{{xj  -  pj)^  >fa))  < 

j=l  3  =  1  ^ 

Observe  that  for  any  random  variables  X  and  Y  and  constants  a  and  &, 

Pr(X  +  Y  >  a  +  &)  <  Pr(X  >  a)  +  Pr(Y  >  b). 

This  follows  from  the  fact  that  ifX  +  Y  >a  +  &  then  either  X  >  a  or  Y  >  b.  Applying 
this  to  the  summation  above  we  have 
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The  first  sum  is  just  the  squared  distance  between  X  and  and  the  second  snm  is  the 
trace  of  the  covariance  matrix  for  the  distribntion.  Taking  sqnare  roots  on  both  sides  we 
have 

Pr(|X  -  IJ\>  ta)  < 

which  completes  the  proof.  □ 

For  ^  >  0,  we  say  that  a  set  of  candidates  is  S -close  with  respect  to  a  given  set  of 
clnsters,  if  for  each  center  there  is  an  associated  clnster  mean  //d)  within  distance  at 
most  Arinin  and  vice  versa.  Here  is  the  main  resnlt  of  this  section. 

Theorem  1  Consider  a  set  of  n  points  in  drawn  from  a  collection  of  cluster  distri¬ 
butions  with  cluster  isolation  p  =  rmin/cTmax,  o-nd  consider  a  set  of  k  candidate  centers 
that  are  8-close  to  the  cluster  means,  for  some  ^  <  1.  Then  for  any  0  <  e  <  1  —  8  and 
for  some  constant  c,  the  expected  number  of  nodes  visited  by  the  filtering  algorithm  is 

Before  giving  the  proof  let  ns  make  a  few  observations  abont  this  resnlt.  The  resnlt 
bonnds  the  nnmber  of  nodes  visited.  The  time  needed  to  process  each  node  is  proportional 
to  the  nnmber  of  candidates  for  the  node,  which  is  at  most  k.  Thns  the  total  rnnning 
time  of  the  algorithm  is  larger  by  at  most  a  factor  of  k.  (Onr  experiments  indicate  that 
the  average  nnmber  of  candidates  per  node  is  typically  a  small  constant,  thongh.)  Also, 
the  total  rnnning  time  inclndes  an  additive  contribntion  of  O(dnlogn)  time  needed  to 
bnild  the  initial  BBD-tree. 

We  note  that  for  hxed  d,  p,  and  e  (bonnded  away  from  0  and  1  —  ^),  the  nnmber  of 
nodes  visited  as  a  fnnction  of  n  is  0(n)  (since  the  last  term  in  the  analysis  dominates). 
Thns  the  overall  rnnning  time  is  0[kn),  which  seems  to  be  no  better  than  the  brnte-force 
algorithm.  The  important  observation  is  that  as  clnstering  isolation  (p)  increases,  the 
rnnning  time  is  0(kn j p^),  and  thns  the  algorithm’s  efficiency  increases  rapidly  as  clnster 
isolation  increases.  In  Section  4  we  will  see  that  the  algorithm  is  qnite  efficient,  even  if 
clnster  isolation  is  not  particnlarly  strong. 

Proof:  Consider  the  tth  clnster  distribntion.  Let  denote  a  candidate  center  that  is 
within  distance  8rm\^^  from  its  mean  Let  denote  its  dispersion.  Let  B^®)  denote 
a  ball  of  radins  rmin  centered  at  Observe  that  becanse  no  clnster  centers  are  closer 
than  2rinin,  these  balls  have  pairwise  disjoint  interiors.  Let  M®)  denote  a  ball  of  radins 
’"min(l  —  e  —  8)  centered  at  c(®h  Becanse  c^®)  is  Bclose  to  pi,  M®)  is  contained  within  a  ball 
of  radins  r,niTi(l  —  e)  centered  at  pC\  and  hence  is  contained  within  Moreover,  the 

distance  between  the  bonndaries  of  B^®)  and  M®)  is  at  least  ermin. 

Consider  a  visited  node  and  let  C  denote  its  associated  cell  in  the  BBD-tree.  We 
consider  two  cases.  First,  snppose  that  for  some  distribntion  i,  C  0  bi  ^  We  call  C  a 
close  node.  Otherwise,  if  the  cell  does  not  intersect  any  of  the  bi  balls  it  is  a  distant  node. 

First  we  bonnd  the  nnmber  of  distant  visited  nodes.  Consider  the  snbtree  of  the 
BBD-tree  indnced  by  these  nodes.  If  a  node  is  distant,  then  its  children  are  both  distant, 
implying  that  this  indnced  snbtree  is  a  fnll  binary  tree  (that  is,  each  nonleaf  node  has 
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exactly  two  children).  It  follows  that  the  total  number  of  nodes  in  the  induced  subtree 
is  not  more  than  twice  the  number  of  leaves  in  the  tree,  which  we  shall  bound  next. 

The  data  points  associated  with  all  the  leaves  of  the  induced  subtree  cannot  exceed  the 
number  of  data  points  that  lie  outside  b.  As  observed  above,  for  any  cluster  distribution 
i,  a  data  point  of  this  distribution  that  lies  outside  M®)  is  at  distance  from  at  least 

f’minll  -  e)  =  -^(1  -  >  /o(l  - 

By  Lemma  3  the  probability  of  this  occuring  is  at  most  dj[p[l  —  e))^.  Thus,  the  expected 
number  of  such  data  points  is  at  most  dnj[p{l  —  e))^.  Since  each  leaf  of  the  BBD-tree 
contains  at  least  one  data  point,  and  each  point  can  be  associated  with  at  most  one  leaf, 
the  number  of  leaves  in  this  induced  subtree  is  bounded  by  the  same  quantity,  and  by 
the  above  observation,  the  total  number  of  distant  nodes  is  at  most  twice  as  large. 

Next  we  bound  the  number  of  close  visited  nodes.  A  visited  node  is  said  to  be 
expanded  if  the  algorithm  visits  its  children.  Clearly  the  number  of  close  visited  nodes 
is  proportional  to  the  number  of  close  expanded  nodes.  For  each  cluster  distribution 
C  consider  the  leaves  of  the  induced  subtree  consisting  of  close  expanded  nodes  that 
intersect  M®h  The  total  number  of  close  expanded  nodes  will  be  larger  by  a  factor  of  k. 
To  bound  the  number  of  such  nodes,  we  further  classify  them  by  the  size  of  the  associated 
cell.  An  expanded  node  v  whose  size  is  at  least  dcmin  is  large  and  otherwise  it  is  small. 
We  will  show  that  the  number  of  large  expanded  nodes  is  bounded  by  0(2*^  log  n)  and 
the  number  of  small  expanded  nodes  is  bounded  by  0((c\/d/e)'^),  for  some  constant  c. 

We  hrst  show  the  bound  on  the  number  of  large  expanded  nodes.  In  the  descent 
through  the  BBD-tree,  the  sizes  of  the  nodes  decrease  monotonically.  Consider  the  set 
of  all  expanded  nodes  of  size  greater  than  4r,niTi.  These  nodes  induce  a  subtree  in  the 
BBD-tree.  Let  L  denote  the  set  of  leaves  of  this  tree.  The  cells  associated  with  the 
elements  of  L  have  pairwise  disjoint  interiors  and  they  intersect  M®)  and  hence  intersect 
B(®h  It  follows  from  Lemma  2  (applied  to  5^®)  and  the  cells  associated  with  L)  that  there 
are  at  most  (1  +  [~4r,niTi /(4r,niTi)]  Y  =  2*^  such  cells.  By  Lemma  1  the  depth  of  the  tree  is 
O(logn),  and  hence  the  total  number  of  expanded  large  nodes  is  0(2*^  log  n),  as  desired. 

Next  we  consider  the  small  expanded  nodes.  We  assert  that  the  diameter  of  the 
cell  corresponding  to  any  such  node  is  at  least  ermin.  If  not,  this  cell  would  lie  entirely 
within  a  disk  of  radius  r,niTi(l  —  d)  of  c(®h  Since  the  cell  was  expanded,  we  know  that 
there  must  be  a  point  in  this  cell  that  is  closer  to  some  other  center  Cj.  This  implies 
that  the  distance  between  c^®)  and  is  less  than  2r,niTi(l  —  d).  Since  the  candidates 
are  ^-close,  it  follows  that  there  are  two  cluster  means  //^®)  and  that  are  closer  than 
2rinin(l  —  ^)  +  2^rinin  =  However,  this  would  contradict  the  dehnition  of  Tmin. 

Armed  with  this  observation,  we  apply  an  argument  similar  to  the  one  used  by  Arya 
and  Mount  [3]  to  bound  the  complexity  of  approximate  range  searching.  A  cell  in  dimen¬ 
sion  d  of  diameter  x  has  longest  side  length  at  least  xl\/d.  Thus  the  size  of  each  such 
cell  is  at  least  erinin/''/d-  It  suffices  to  count  the  number  of  expanded  nodes  of  sizes  from 
dcinin  down  to  eTjainl'sfd-  To  do  this  we  partition  nodes  into  groups  according  to  their 
size.  For  i  >  0,  dehne  size  group  i  to  be  the  set  of  nodes  whose  cell  size  is  in  the  interval 
(1/2®"*"^,  1/2®].  Since  small  nodes  are  of  size  less  than  dcmin,  the  hrst  size  group  of  interest 
is  a  +  f,  where  f/2®"’"^  <  dcmin  <  l/2“,  and  hence  a  =  [—  Ig  4r,niTiJ .  Since  nodes  that 
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are  smaller  than  errv,\r^/\/d  are  not  expanded,  the  last  size  gronp  of  interest  is  &,  where 
1/2^+^  <  erinin/\/d  <  1/2^,  and  hence  b  =  |^— lg(  e’"min/''/d)J  •  Becanse  a  node  and  its 
child  may  have  the  same  size,  we  cannot  apply  the  packing  lemma  directly  to  each  size 
gronp.  Dehne  the  base  group  for  the  iith  size  gronp  to  be  the  snbset  of  nodes  in  the  size 
gronp  that  are  leaves  or  whose  children  are  both  in  the  next  smaller  size  gronp.  The  cells 
corresponding  to  the  nodes  in  a  base  gronp  have  pairwise  disjoint  interiors,  since  none  of 
their  descendants  can  be  in  the  same  base  gronp.  Applying  Lemma  2,  it  follows  that  the 
nnmber  of  nodes  in  the  Ah  base  gronp  is  at  most 


1  + 


dCniin 

l/2» 


d 


From  claim  (ii)  of  Lemma  f  we  know  that  at  most  2d  levels  of  ancestors  above  the  base 
gronp  can  be  in  the  same  size  gronp,  and  thns  the  number  of  nodes  in  any  size  gronp  is 
at  most  2d  times  the  above  qnantity. 

Thns,  the  total  number  of  expanded  nodes  over  all  of  the  base  gronps  is 


£'d(rinin,  e)  <  ^  (l  +  [rinin2*+^j]  . 
Solving  this  geometric  series,  we  have 


for  some  constant  c.  This  completes  the  analysis  of  the  number  of  small  expanded  close 
nodes.  Snmming  this  bonnd  over  all  k  clnsters  provides  the  desired  resnlt. 

□ 


4  Empirical  Analysis 

To  establish  the  practical  efficiency  of  the  hltering  algorithm  we  implemented  it  and  tested 
it  on  a  nnmber  of  data  sets,  both  synthetically  generated  and  from  real  applications.  We 
implemented  a  simpler  variant  of  the  BBD-tree,  which  constrncts  a  kd-tree  nsing  the 
sliding  midpoint  rnle  [29].  This  decomposition  method  was  chosen  becanse  onr  stndies 
have  shown  that  it  performs  better  than  the  standard  kd-tree  decomposition  rnle  for 
clnstered  data  sets.  Data  points  were  chosen  from  a  clnstered  Ganssian  distribntion.  In 
particnlar,  a  given  nnmber  of  clnster  centers  were  sampled  from  a  nniform  distribntion 
over  the  hypercnbe  [— f ,  f]*^.  A  Ganssian  distribntion  was  generated  aronnd  each  center, 
in  which  each  coordinate  was  generated  independently  from  a  nnivariate  Ganssian  with 
a  given  standard  deviation. 

We  wanted  to  compare  onr  algorithm  against  existing  implementations  of  fc-means  al¬ 
gorithms,  and  Lloyd’s  algorithm  in  particnlar.  We  fonnd  this  difficnlt  to  do  for  a  nnmber 
of  reasons.  Many  algorithms  come  as  part  of  a  large  software  package  for  data  analysis, 
and  it  is  not  easy  to  modify  the  software  to  add  the  desired  instrnmentation.  Also  many 
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of  these  packages  do  not  provide  the  user  with  the  ability  to  control  termination  condi¬ 
tions  or  to  determine  the  initial  placements  of  centers.  Because  of  the  slow  convergence 
of  fc-means,  many  implementations  seek  to  speed  up  the  process  or  improve  its  clustering 
performance  for  the  application  of  interest  by  implementing  an  algorithm  that  is  different 
from  Lloyd’s  algorithm.  This  might  involve  modifying  the  metric,  by  computing  approxi¬ 
mate  nearest  neighbors,  or  by  applying  different  update  strategies.  Since  the  algorithm’s 
behavior  is  very  sensitive  to  the  placement  of  the  centers,  any  deviation  from  Lloyd’s 
algorithm  could  result  in  signihcantly  different  performance  on  a  given  data  set. 

Lloyd’s  algorithm  involves  two  basic  components,  computing  nearest  neighbors  for  the 
data  points  relative  to  the  centers,  and  moving  the  center  points  to  the  centroids  of  their 
neighborhoods.  Other  than  ours,  all  implementations  that  we  have  found  that  are  true  to 
this  method  differ  essentially  only  in  how  the  nearest  neighbors  are  computed.  For  our 
experiments,  we  implemented  two  common  methods  for  computing  nearest  neighbors. 
The  hrst  is  by  simple  brute  force^  computing  the  distance  from  each  data  point  to  every 
center.  The  second  is  by  building  a  nearest  neighbor  search  structure  for  the  centers. 
We  did  this  by  computing  a  kd-tree  for  the  centers  [t9],  which  we  call  the  kd-center 
algorithm.  This  was  done  using  the  ANN  library  [32],  using  the  same  decomposition 
method  mentioned  above.  Our  studies  compared  these  two  methods  to  the  hltering 
algorithm.  We  performed  two  sets  of  experiments,  one  involving  synthetic  data  and  the 
other  involving  data  derived  from  applications  in  image  segmentation  and  compression. 

Each  experiment  involved  running  all  three  algorithms:  the  brute-force  algorithm,  the 
kd-center  algorithm,  and  our  hltering  algorithm.  Even  for  the  same  data  set,  different 
starting  conhgurations  of  centers  result  in  different  numbers  of  stages.  However,  given 
the  same  data  set  and  the  same  starting  centers,  the  number  of  stages  is  identical  for  all 
three  algorithms.  Because  we  were  interested  in  comparing  the  relative  running  times 
of  the  three  algorithms,  we  chose  to  factor  out  the  number  of  stages,  by  dividing  the 
total  time  by  the  number  of  stages.  For  the  brute-force  and  kd-center  algorithms,  this  is 
quite  reasonable  since  the  processing  at  each  stage  is  virtually  identical.  However,  this 
introduces  a  bias  for  the  hltering  algorithm.  Since  the  total  running  time  includes  the 
one-time  cost  of  building  the  kd-tree  for  the  data  points,  by  averaging  over  the  stages,  we 
effectively  amortize  this  one-time  cost  over  the  stages.  If  the  algorithm  converges  after 
only  a  small  number  of  stages,  then  the  cost  per  stage  is  artihcially  increased  as  a  result. 

We  measured  running  time  in  two  different  ways.  We  measured  both  the  CPU  time 
(using  the  standard  clock ()  function),  and  a  second  quantity  called  the  number  of  node¬ 
candidate  pairs.  The  latter  quantity  is  a  machine-independent  statistic  of  the  algorithm’s 
complexity.  Intuitively,  it  measures  the  number  of  interactions  between  a  node  of  the  kd- 
tree  (or  data  point  in  the  case  of  brute  force)  and  a  candidate  center.  For  the  brute-force 
algorithm,  this  quantity  is  always  kn.  For  the  kd-center  algorithm,  for  each  data  point  we 
count  the  number  of  nodes  that  were  accessed  in  the  kd-tree  of  the  centers  for  computing 
its  nearest  center  and  sum  this  over  all  data  points.  For  the  hltering  algorithm,  we 
computed  a  sum  of  the  number  of  candidates  associated  with  every  visited  node  of  the 
tree.  In  particular,  for  each  call  to  Filter(n,  Z),  the  cardinality  of  the  set  Z  of  candidate 
centers  is  accumulated.  The  one-time  cost  of  building  the  kd-tree  is  not  included  in  this 
measure  (and  hence  the  amortization  issue  discussed  above  is  not  a  problem  here). 

Note  that  the  three  algorithms  are  functionally  equivalent.  They  all  produce  the  same 


to 


final  result,  and  so  there  is  no  issue  regarding  the  quality  of  the  hnal  output.  The  issue 
of  interest  for  us  is  the  efficiency  of  the  computation. 

4.1  Synthetic  Data 

We  ran  three  experiments  to  determine  the  variations  in  running  time  as  a  function 
of  cluster  isolation  and  data  set  size.  The  hrst  two  experiments  were  designed  to  test 
the  validity  of  Theorem  1.  We  generated  n  =  10,000  data  points  in  distributed 
evenly  among  50  clusters  from  a  clustered  Gaussian  distribution.  The  standard  deviation 
varied  from  0.01  (very  well  isolated)  up  to  0.7  (virtually  no  isolation).  Because  the  same 
distribution  is  used  for  cluster  centers  throughout,  the  expected  distances  between  cluster 
centers  will  be  constant  throughout.  Thus  the  expected  value  of  the  cluster  isolation 
parameter  p  varies  inversely  with  the  standard  deviation.  For  the  hrst  experiment  we 
chose  =  50  centers  for  each  run  and  for  the  second  we  chose  k  =  20.  The  initial  centers 
were  chosen  by  taking  a  random  sample  of  data  points. 

For  each  standard  deviation  we  ran  each  of  the  algorithms  (brute-force,  kd-center, 
and  hlter)  three  times.  For  all  runs  the  same  data  points  were  used.  For  each  of  the 
three  runs  a  new  set  of  initial  center  points  was  generated,  and  all  three  algorithms  were 
run  using  the  same  data  and  initial  center  points.  The  algorithm  ran  for  a  maximum  of 
30  stages  or  until  convergence.  The  reason  that  we  felt  it  safe  to  limit  the  number  of 
stages  is  that  the  running  times  tend  to  remain  constant  after  the  hrst  few  stages. 

The  average  CPU  times  per  stage  for  all  three  methods  are  shown  in  Fig.  3(a),  and 
the  average  number  of  node-candidate  pairs  per  stage  for  the  hltering  algorithm  is  shown 
in  (b).  The  same  results  are  shown  for  the  case  =  20  in  Fig.  4.  As  mentioned  above, 
if  the  algorithm  runs  for  only  a  small  number  of  stages,  a  signihcant  bias  is  introduced 
into  the  hltering  algorithm  to  account  for  the  time  needed  to  build  the  initial  kd-tree. 
This  bias  was  quite  noticeable  for  highly  clustered  instances  (small  standard  deviation). 
In  both  graphs  we  show  the  average  running  time  for  the  hltering  algorithm,  both  with 
and  without  the  initialization  bias  included.  For  purposes  of  comparison  with  the  results 
of  Theorem  1  the  times  without  initialization  are  the  more  relevant. 

The  results  of  these  experiments  show  that  the  hltering  algorithm  runs  signihcantly 
faster  than  the  other  two  algorithms.  Ignoring  the  initialization  bias,  its  running  time 
improves  when  the  clusters  are  more  isolated.  These  experiments  also  show  that  as  the 
standard  deviation  decreases  [p  increases)  the  running  time  for  the  hltering  algorithm 
decreases  rapidly. 

The  objective  of  the  third  experiment  was  to  study  the  effects  of  data  size  on  the 
running  time.  We  generated  data  points  of  various  sizes  in  a  clustered  Gaussian  distribu¬ 
tion,  again  with  50  clusters  and  with  k  =  50.  The  standard  deviation  was  hxed  at  0.10. 
The  data  size  varied  from  n  =  1,  000  to  n  =  20,  000.  Again,  the  searches  were  terminated 
when  stability  was  achieved  or  after  30  stages,  and  in  each  case  we  made  three  runs  with 
different  starting  centers  and  averaged  the  results.  The  CPU  times  per  stage  and  number 
of  node-candidate  pairs  are  shown  in  Fig.  5  for  all  three  methods. 

The  results  of  the  third  experiment  show  that  for  hxed  k  and  large  n,  all  three 
algorithms  have  running  times  that  vary  linearly  with  n,  with  the  hltering  algorithm 
doing  the  best. 
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Figure  3:  Average  CPU  time  and  node-candidate  pairs  per  stage  versus  cluster  standard 
deviation,  for  n  =  10,  000  and  k  =  50. 
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Standard  deviation  Standard  deviation 

Figure  4:  Average  CPU  time  and  node-candidate  pairs  per  stage  versus  cluster  standard 
deviation,  for  n  =  10,  000  and  k  =  20. 
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CPU  Seconds  per  Stage 


Figure  5:  Running  time  and  node-candidate  pairs  versus  data  size,  for  =  50  and 

(T  =  0.10. 


4.2  Real  Data 

To  understand  the  relative  efficiency  of  this  algorithm  under  more  practical  circum¬ 
stances,  we  ran  a  number  of  experiments  on  data  sets  derived  from  actual  applications  in 
image  processing  and  compression.  Each  experiment  involved  solving  a  fc-means  problem 
on  a  set  of  points  in  R'^  for  various  d’s.  In  each  case  we  applied  all  three  algorithms:  brute 
force  (Brute),  kd-center  (KDCen),  and  hltering  (Filter).  In  each  case  we  computed  the 
average  CPU  time  per  stage  in  seconds  (T-)  and  the  average  number  of  node-candidate 
pairs  (NC-).  This  was  repeated  for  values  of  k  chosen  from  {8,64,256}.  The  results  are 
shown  in  Table  1. 

The  experiments  are  grouped  into  three  general  categories.  The  hrst  involve  a  color 
quantization  application.  A  color  image  is  given  and  a  number  of  pixels  are  sampled  from 
the  image  (10,000  for  this  experiment).  Each  pixel  is  represented  as  a  triple  consisting  of 
red,  green  and  blue  components,  each  in  the  range  [0,255],  and  hence  is  a  point  in  R^. 
The  input  images  were  chosen  from  a  number  of  standard  images  for  this  application. 
The  results  are  shown  in  the  upper  half  of  the  table,  under  the  names  “balls,”  “kiss,” 

.  .  .  ,  “bam_s.” 

The  next  experiment  involved  a  vector  quantization  application  for  data  compression. 
Here  the  images  are  gray-scale  images  with  pixel  values  in  the  range  [0,255].  Each 
2x2  subarray  of  pixels  is  selected  and  mapped  to  a  4-element  vector,  and  the  fc-means 
algorithm  is  run  on  the  resulting  set  of  vectors.  These  are  shown  in  the  table  under  the 
names  “couple,”...,  “woman2.” 

The  hnal  experiment  involved  an  image  segmentation  problem.  A  512  x  512  LandSat 
image  of  Israel  consisting  of  7  spectral  bands  was  used.  The  resulting  7-element  vectors 
in  the  range  [0,  255]  were  presented  to  the  algorithms.  This  is  shown  in  the  table  under 
the  name  “Israel.” 

An  inspection  of  the  results  shows  that  the  hltering  algorithm  signihcantly  outper¬ 
formed  the  other  two  algorithms  in  all  cases.  Plots  of  the  underlying  point  distributions 
showed  that  most  of  these  data  sets  were  really  not  well  clustered.  Thus  the  hltering 
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Image 

Dim 

Size 

k 

T-Brute 

T-KDCen 

T-Filter 

NC-Brute 

NC-KDCen 

NC-Filter 

8 

0.0542857 

0.0804762 

0.0157143 

68.13 

3.678 

balls 

3 

10000 

64 

0.20875 

0.1515 

0.01875 

160.9 

17.65 

256 

0.742424 

0.17875 

0.0375758 

180 

49.77 

8 

0.053 

0.08225 

0.0165 

71.92 

10.07 

kiss 

3 

10000 

64 

0.209 

0.14875 

0.0435 

170.4 

41.76 

256 

0.7455 

0.218 

0.089 

243.9 

111 

8 

0.0538889 

0.0797222 

0.0163889 

80 

65.18 

9.767 

Lena 

3 

10000 

64 

0.20925 

0.14575 

0.046 

640 

164.7 

43.72 

256 

0.75025 

0.218 

0.0915 

2560 

236.8 

118.2 

8 

0.0535 

0.0755 

0.0095 

57.27 

3.859 

balllji 

3 

10000 

64 

0.207 

0.119118 

0.0245 

117.8 

20.98 

256 

0.749231 

0.16575 

0.0584615 

167.2 

69.42 

8 

0.0535 

0.07625 

0.0095 

80 

58.88 

4.073 

balll_3 

3 

10000 

64 

0.208571 

0.11641 

0.0251429 

640 

115.9 

20.69 

256 

0.748846 

0.166364 

0.0553846 

2560 

167.8 

65.83 

8 

0.0541667 

0.08 

0.0225 

62.72 

2.712 

balll_5 

3 

10000 

64 

0.214688 

0.125278 

0.029375 

122.4 

23.16 

256 

0.74775 

0.178 

0.059 

175.5 

71.89 

8 

0.0535714 

0.0778571 

0.0207143 

80 

64.37 

3.12 

balll_p 

3 

10000 

64 

0.2095 

0.122 

0.0285 

640 

122.2 

23.02 

256 

0.743043 

0.17 

0.0608333 

2560 

174.2 

72.93 

8 

0.0771429 

57.95 

3.497 

balll_s 

3 

10000 

64 

0.123939 

123.5 

24.04 

256 

0.17225 

176 

74.3 

8 

0.37475 

0.5575 

0.15825 

524.3 

482.3 

74.6 

couple 

4 

65536 

64 

1.616 

1.21925 

0.3535 

4194 

1452 

285.8 

256 

5.501 

1.732 

0.658 

1.678e+04 

2375 

739.6 

8 

0.36875 

0.57375 

0.146 

524.3 

497.7 

62.35 

crowd 

4 

65536 

64 

1.60975 

1.304 

0.32175 

4194 

1512 

244.9 

256 

5.49975 

1.716 

0.535 

1.678e+04 

2347 

567.3 

8 

0.36675 

0.59225 

0.18475 

524.3 

558.4 

99.04 

lax 

4 

65536 

64 

1.6085 

1.3725 

0.42325 

4194 

1772 

359.7 

256 

5.495 

1.86175 

0.7055 

1.678e+04 

2776 

790.6 

8 

0.373 

0.55475 

0.123 

524.3 

471.5 

50.83 

lena 

4 

65536 

64 

1.615 

1.19375 

0.34475 

4194 

1339 

266.9 

256 

5.50475 

1.649 

0.618 

1.678e+04 

2200 

677.1 

8 

0.3685 

0.5505 

0.13925 

524.3 

465.9 

63.16 

man 

4 

65536 

64 

1.609 

1.20275 

0.35375 

4194 

1392 

278.1 

256 

5.49975 

1.66125 

0.60675 

1.678e+04 

2275 

665.4 

8 

0.370937 

0.563125 

0.176875 

524.3 

483.4 

73.65 

womanl 

4 

65536 

64 

1.612 

1.28275 

0.3805 

4194 

1500 

304.2 

256 

5.523 

1.7945 

0.64325 

1.678e+04 

2450 

706.8 

8 

0.37175 

0.544 

0.1125 

524.3 

433.3 

35.46 

woman2 

4 

65536 

64 

1.615 

1.20075 

0.30775 

4194 

1277 

214.2 

256 

5.51675 

1.64675 

0.519 

1.678e+04 

2133 

531.4 

8 

1.7945 

2.739 

1.26775 

2097 

2170 

356.2 

Israel 

7 

262144 

64 

8.0195 

6.2145 

4.78825 

1.678e+04 

7520 

1581 

256 

28.5027 

9.76025 

7.324 

6.711e+04 

1.367e+04 

4434 

Table  1:  Running  times  for  various  test  inputs. 
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algorithm  is  quite  efficient,  even  when  the  conditions  of  Theorem  1  are  not  satished. 


5  Concluding  Remarks 

We  have  presented  an  efficient  implementation  of  Lloyd’s  fc-means  clustering  algorithm, 
called  the  hltering  algorithm.  The  algorithm  is  easy  to  implement  and  only  requires  that 
a  kd-tree  be  built  once  for  the  data  points.  Efficiency  is  achieved  because  the  data  points 
do  not  vary  throughout  the  computation,  and  hence  this  data  structure  does  not  need  to 
be  recomputed  at  each  stage,  and  there  are  typically  many  more  data  points  than  query 
points,  and  hence  the  relative  advantages  provided  by  preprocessing  are  greater.  This 
algorithm  differs  from  most  other  algorithms  only  in  how  nearest  centers  are  computed, 
so  it  could  be  applied  to  the  many  of  the  variants  of  Lloyd’s  algorithm. 

We  have  demonstrated  the  practical  efficiency  of  this  algorithm  both  theoretically 
through  a  data-sensitive  analysis  and  empirically  through  experiments  on  both  synthet¬ 
ically  generated  and  real  data  sets.  The  data-sensitive  analysis  shows  that  the  more 
well-isolated  are  the  clusters,  the  faster  the  algorithm  runs.  This  is  subject  to  the  as¬ 
sumption  that  the  center  points  are  indeed  close  to  the  cluster  centers.  Although  this 
assumption  is  rather  strong,  our  empirical  analysis  on  synthetic  data  indicates  that  the 
algorithm’s  running  time  does  improve  dramatically  as  cluster  isolation  increases.  These 
results  for  both  synthetic  and  real  data  sets  indicate  that  the  hltering  algorithm  is  sig- 
nihcantly  more  efficient  than  the  other  two  methods  that  were  tested. 

A  natural  question  is  whether  the  hltering  algorithm  can  be  improved.  The  most 
obvious  source  of  inefficiency  in  the  algorithm  is  that  it  passes  no  information  from  one 
stage  to  the  next.  Presumably  in  the  later  stages  of  Lloyd’s  algorithm,  as  the  centers 
are  converging  to  their  hnal  positions,  one  would  expect  that  the  vast  majority  of  the 
data  points  have  the  same  closest  center  from  one  stage  to  the  next.  A  good  algorithm 
would  exploit  this  coherence  to  improve  running  time.  A  kinetic  method  along  these 
lines  was  proposed  in  [25],  but  this  algorithm  is  quite  complex,  and  does  not  provide 
signihcantly  faster  running  time  in  practice.  The  development  of  a  simple  and  practical 
algorithm  which  combines  the  best  elements  of  the  kinetic  and  hltering  approaches  would 
be  a  signihcant  contribution. 

References 

[1]  P.  K.  Agarwal  and  C.  M.  Procopiuc.  Exact  and  approximation  algorithms  for  clus¬ 
tering.  In  Proc.  9th  ACM-SIAM  Sympos.  Discrete  Algorithms,  pages  658-667,  1998. 

[2]  S.  Arora,  P.  Raghavan,  and  S.  Rao.  Approximation  schemes  for  Euclidean  fc-median 
and  related  problems.  In  Proc.  30th  Annu.  ACM  Sympos.  Theory  Comput.,  pages 
106-113,  1998. 

[3]  S.  Arya  and  D.  M.  Mount.  Approximate  range  searching.  In  Proc.  11th  Annu.  ACM 
Sympos.  Comput.  Geom.,  pages  172-181,  1995. 

[4]  S.  Arya,  D.  M.  Mount,  N.  S.  Netanyahu,  R.  Silverman,  and  A.  Wu.  An  optimal 
algorithm  for  approximate  nearest  neighbor  searching.  .J.  ACM,  45:891-923,  1998. 


15 


[5]  J.  L.  Bentley.  Multidimensional  binary  search  trees  used  for  associative  searching. 
Commun.  ACM,  18:509-517,  1975. 

[6]  M.  Bern.  Approximate  closest-point  queries  in  high  dimensions.  Inform.  Process. 
Lett,  45:95-99,  1993. 

[7]  L.  Bottou  and  Y.  Bengio.  Convergence  properties  of  the  fc-means  algorithms.  In 
G.  Tesauro  and  D.  Touretzky,  editors.  Advances  in  Neural  Information  Processing 
Systems  7,  pages  585-592.  MIT  Press,  1995. 

[8]  P.  B.  Callahan  and  S.  R.  Kosaraju.  A  decomposition  of  multidimensional  point 
sets  with  applications  to  fc-nearest-neighbors  and  n-body  potential  helds.  .1.  ACM, 
42:67-90,  1995. 

[9]  V.  Capoyleas,  Gunter  Rote,  and  G.  Woeginger.  Geometric  clusterings.  .1.  Algorithms, 
12:341-356,  1991. 

[10]  K.  L.  Clarkson.  Fast  algorithms  for  the  all  nearest  neighbors  problem.  In  Proc.  2fth 
Annu.  IEEE  Sympos.  Eound.  Comput.  Sci.,  pages  226-232,  1983. 

[11]  .J.  M.  Coggins  and  A.  K.  .Jain.  A  spatial  hltering  approach  to  texture  analysis. 
Pattern  Recognition  Letters,  3:195-203,  1985. 

[12]  Q.  Du,  V.  Faber,  and  M.  Gunzburger.  Centroidal  Voronoi  tesselations:  Applications 
and  algorithms.  SIAM  Review,  41:637-676,  1999. 

[13]  R.  0.  Duda  and  P.  F.  Hart.  Pattern  Classification  and  Scene  Analysis.  John  Wiley 
&  Sons,  New  York,  NY,  1973. 

[14]  C.  Duncan,  M.  Goodrich,  and  S.  Kobourov.  Balanced  aspect  ratio  trees:  Combining 
the  advantages  of  k-d  trees  and  octrees.  In  Proc.  10th  ACM-SIAM  Sympos.  Discrete 
Algorithms,  pages  300-309,  1999. 

[15]  V.  Faber.  Clustering  and  the  continuous  fc-means  algorithm.  Los  Alamos  Science, 
22:138-144,  1994. 

[16]  U.  M.  Fayyad,  G.  Piatetsky-Shapiro,  P.  Smyth,  and  R.  Uthurusamy.  Advances  in 
Knowledge  Discovery  and  Data  Mining.  AAAI/MIT  Press,  1996. 

[17]  W.  Feller.  An  Introduction  to  Probability  Theory  and  its  Applications.  John  Wiley 
&  Sons,  New  York,  NY,  3rd  edition,  1968. 

[18]  F.  Forgey.  Cluster  analysis  of  multivariate  data:  Ffficiency  vs.  interpretability  of 
classihcation.  Biometrics,  21:768,  1965. 

[19]  J.  H.  Friedman,  J.  L.  Bentley,  and  R.  A.  Finkel.  An  algorithm  for  hnding  best 
matches  in  logarithmic  expected  time.  ACM  Trans.  Math.  Software,  3:209-226, 
1977. 


16 


[20]  K.  Fukunaga.  Introduction  to  Statistical  Pattern  Recognition.  Academic  Press, 
Boston,  MA,  1990. 

[21]  A.  Gersho  and  R.  M.  Gray.  Vector  Quantization  and  Signal  Compression.  Klnwer 
Academic,  Boston,  MA,  1992. 

[22]  M.  Inaba,  H.  Imai,  and  N.  Katoh.  Experimental  resnlts  of  a  randomized  clnstering 
algorithm.  In  Proc.  12th  Annu.  ACM  Sympos.  Comput.  Geom.,  pages  C1-C2,  1996. 

[23]  M.  Inaba,  N.  Katoh,  and  H.  Imai.  Applications  of  weighted  Voronoi  diagrams  and 
randomization  to  variance-based  fc-clnstering.  In  Proc.  10th  Annu.  ACM  Sympos. 
Comput.  Geom..,  pages  332-339,  1994. 

[24]  A.  K.  .Jain  and  R.  C.  Dnbes.  Algorithms  for  Clustering  Data.  Prentice  Hall,  Engle¬ 
wood  Cliffs,  NJ,  f988. 

[25]  T.  Kannngo,  D.  M.  Monnt,  N.  S.  Netanyahn,  C.  Piatko,  R.  Silverman,  and  A.  Y.  Wn. 
Compnting  nearest  neighbors  for  moving  points  and  applications  to  clnstering.  In 
Proc.  10th  Ann.  ACM-SIAM  Sympos.  Discrete  Algorithms^  pages  S931-S932,  1999. 

[26]  S.  Kollioponlos  and  S.  Rao.  A  nearly  linear-time  approximation  scheme  for  the 
Enclidean  fc-median  problem.  In  Proc.  7th  Annual  European  Sympos.  on  Algorithms^ 
1999. 

[27]  S.  P.  Lloyd.  Least  sqnares  qnantization  in  PCM.  IEEE  Trans,  on  Inf.  Theory, 
28:129-137,  1982. 

[28]  J.  MacQneen.  Some  methods  for  classihcation  and  analysis  of  mnltivariate  observa¬ 
tions.  In  Proc.  Pifth  Berkeley  Sympos.  on  Math.  Stat.  and  Prob.,  volnme  1,  pages 
281-296,  1967. 

[29]  S.  Maneewongvatana  and  D.  M.  Monnt.  Analysis  of  approximate  nearest  neighbor 
searching  with  clnstered  point  sets.  In  ALENEX,  1999. 

[30]  0.  L.  Mangasarian.  Mathematical  programming  in  data  mining.  Data  Mining  and 
Knowledge  Discovery,  1,  1997. 

[31]  J.  Matonsek.  On  approximate  geometric  fc-clnstering.  Mannscript,  1999. 

[32]  D.  M.  Monnt  and  S.  Arya.  ANN:  A  library  for  approximate  nearest  neighbor  search¬ 
ing.  Center  for  Geometric  Compnting  2nd  Annnal  Fall  Workshop  on  Compntational 
Geometry,  URL:  http://www.cs.umd.edu/~mount/ANN.,  1997. 

[33]  D.  Pollard.  A  centeral  limit  theorem  for  fc-means  clustering.  Annals  of  Probability, 
10:919-926,  1982. 

[34]  H.  Samet.  The  Design  and  Analysis  of  Spatial  Data  Structures.  Addison-Wesley, 
Reading,  MA,  1990. 


17 


[35]  S.  Z.  Selim  and  M.  A.  Ismail,  if-means-type  algorithms:  a  generalized  convergence 
theorem  and  characterization  of  local  optimality.  IEEE  Trans,  on  Pattern  Analysis 
and  Machine  Intelligence,  6:81-87,  1984. 

[36]  P.  M.  Vaidya.  An  O(nlogn)  algorithm  for  the  all-nearest-neighbors  problem.  Dis¬ 
crete  Comput.  Geom.,  4:101-115,  1989. 


18 


